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On-line learning of a hierarchical learning model is studied by a method from statistical 
mechanics. In our model a student of a simple perceptron learns from not a true teacher 
directly, but ensemble teachers who learn from the true teacher with a perceptron learning 
rule. Since the true teacher and the ensemble teachers are expressed as non-monotonic per- 
ceptron and simple ones, respectively, the ensemble teachers go around the unlearnable true 
teacher with the distance between them fixed in an asymptotic steady state. The general- 
ization performance of the student is shown to exceed that of the ensemble teachers in a 
transient state, as was shown in similar ensemble-teachers models. Further, it is found that 
moving the ensemble teachers even in the steady state, in contrast to the fixed ensemble 
teachers, is efficient for the performance of the student. 
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1. Introduction 

Learning is an inference problem of inhered rules from a given set of examples which 
consist of input data and corresponding output data generated by the rules. In practice, the 
examples are often supplied inexhaustibly and then the learning must proceed by using each 
example just once. Such learning is called on-line learning. ^"^-^ On the contrary, the learning in 
which all the examples are presented repeatedly at anytime is called off-line or batch learning. 

The on-line learning as well as the off-line one has been extensively studied by using 
statistical-mechanical methods so far and many extensions of the on-line learning scheme 
have been made in order to improve a generalization performance.^' Recently, Miyoshi and 
Okada^^ and Urakami, Miyoshi and Okada^^ analyzed the generalization performance of a 
student supervised by a moving teacher that goes around a fixed true teacher in a framework 
of the on-line learning using the statistical mechanical method. In their model, the student is 
not directly given the outputs by the true teacher. The moving teacher learns from the true 
teacher and provides its output to the student. In this sense, the model is a kind of hierarchical 
learning. In ref. 5, the true teacher is a non-monotonic perceptron, while the moving teacher 
and the student are simple perceptron using perceptron learning, which could not infer the 
true teacher completely in principle. The theoretical bound of the generalization error of a 
simple perceptron learner has been obtained. In that case, the moving teacher goes around 
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the true teacher with a fixed distance between them. Interestingly, it turned out that when the 
student's learning rate is relatively small, the student's generalization error can temporally 
become smaller than that of the moving teacher, even if the student only uses the examples 
from the moving teacher. 

Subsequently, Miyoshi and Okada''^ and Utsumi, Miyoshi and Okada^) analyzed the gen- 
eralization performance of an extended model of the on-line learning with multiple teachers, 
which would be called ensemble-teachers learning model. This model is also regarded as an 
extension of the ensemble learning^' ^''^ because the ensemble teachers and the student in the 
ensemble-teachers model can be interpreted as the ensemble students and their integrating 
mechanism, respectively. In particular, ref. 8 discussed the model in which the true teacher, 
the ensemble teachers and the student are all simple perceptrons. In this model the true 
teacher and the ensemble teachers are fixed. The student adopts the Hebbian learning or 
the perceptron learning as a learning rule and uses examples from the ensemble teachers in 
turn or randomly. As a result, it was clarified that the Hebbian learning and the perceptron 
learning show qualitatively different behavior from each other. In the Hebbian learning, the 
generalization error monotonically decreases during the learning process and its asymptotic 
value is independent of the learning rate. The asymptotic value is reduced as the number of 
the ensemble teachers increases since the ensemble teachers have more variety in their rep- 
resentations. On the other hand, in the perceptron learning, the generalization error shows 
non-monotonic behavior and exhibits a minimum at a certain step in the learning. The min- 
imum value of the generalization error decreases as the learning rate decreases and the total 
number of the teachers increases. 

In ref. 5 and ref. 8, it was shown that the generalization error of a student could be 
smaller than that of a moving teacher or fixed ensemble teachers. A comparison between the 
generalization performance with a fixed teacher and that with a mobile teacher, however, has 
not been made directly. Furthermore, in the on-line learning with the ensemble teachers it 
is not trivial that either the mobility or the multiplicity of the ensemble teachers is effective 
for the learning performance of the student. In this paper, we study the on-line learning for 
the ensemble teachers which can move around a true teacher. We discuss a model in which 
the fixed true teacher is non-monotonic perceptron and the ensemble moving teachers and 
the student are a simple perceptron. This is a generalized version of the model studied in 
ref. 5. Adopting the perceptron learning as a learning rule for the ensemble teachers, they 
go around the true teacher with constant order parameters in the steady state. Then we 
analyze the generalization performance of the student which learns from the mobile ensemble 
teachers using the Hebbian and the perceptron rules. We also study the model with the 
ensemble teachers fixed in their steady state. It is thus clarified that the movement of the 
ensemble teachers , in comparison with the fixed ensemble case, significantly improves the 
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generalization performance of the student as a transient state in the learning process. 

The paper is organized as follows: In sec. 2, we introduce the model with the ensemble 
moving teachers going around the unlearnable true teacher. In sec. 3, based on the statistical- 
mechanical idea, we theoretically derive the ordinal differential equations of order parameters 
and an explicit formula of the generalization error of our model in terms of the order parame- 
ters. In sec. 4, we show the theoretical and numerical results of the generalization performance 
of the student with the Hebbian and perceptron rules. The last section is devoted to our con- 
clusion. In the appendixes, the derivations of the differential equations discussed in sec. 3 are 
presented in detail. 

2. Model 



In this paper, we consider a true teacher, K ensemble moving teachers and a student, 
whose connection weights are expressed as N dimensional vectors. A, B]^ and J, respectively, 
with A; = 1, 2, • • • , For simplicity, each component Ai of A with i = 1, • • • , is assumed 
to be drawn from AA(0, 1) independently and fixed, where J\f{'m,a'^) denotes the Gaussian 
distribution with m and cr^ being a mean and variance, respectively. As an initial condition 
of the learning process, each of the components B'^- and Jf of B^, are also assumed 
to be drawn from AA(0, 1) independently. Input x is also the A^-dimensional vector and the 
component Xj follows from AA(0, independently. Thus, we have 

{A,) = {Bl) = (J0> = (x,) = 0, (2.1) 

{{A,?) = {[Bl^') = {[J^f) = l, (2.2) 

and 

((^.)'> = ^, (2.3) 

where (• • • ) denotes an average over the Gaussian distribution. 

In the statistical mechanics of the learning,^' ^) we are interested in asymptotic behavior 
of A, B and J in a thermodynamics limit N ^ oo. Then, one finds that the norms of the 
vectors are 

||A|| = y/N, \\Bl\\ = y/N, ||J°|| = VN, \\x\\ = 1. (2.4) 

The norms, \\Bi^\\ and || of the ensemble moving teachers and the student change during the 
learning process from their initial values. The normalized length of these vectors is introduced 
as Ib^. = \\B fc||/||B^|| for the ensemble teachers and Ij = || J||/|| J*^!! for the student. In the 
thermodynamic limit, the direction cosines between these vectors are a relevant extensive 
quantity, denoted for A and B^, A and J, B^ and Bf^i, and -B^ and J respectively as 

A Bk „ A J 
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Bi- * Bi. „ B k- 'J / s 

= IIR llllRMr ^^^^ = IIR IIII7II - (2.6) 
ll-Ofcllll-Dfcll ll-DfclNI^II 

In the present study, we assume that the true teacher is a non-monotonic perceptron and 
the ensemble moving teachers and the student are a simple perceptron. The output for a given 
input X of the true teacher is defined by a non-monotonic function 

o = sgn ((A ■ X - a) A - x{A - X + a)) (2.7) 

with a fixed threshold a, while those of the ensemble moving teachers and the student are 
simply given by sgn[Bk-x) and sgn(J-a;), respectively. Here, sgn(-) is the sign function 
defined as 

s^W = |+'' (2.8) 
[ -1, s < 0. 

A measure of dissimilarity between the true teacher and the ensemble teachers or the student 
is defined by using their outputs as 

eB, = G(-o-sgn(Sfc-a;)) (2.9) 

for /cth ensemble teacher and 

ej = Q{-o-sgn{J -x)) (2.10) 
for the student, where 0(-) is the step function defined as 

ew4+'- (2.11, 

[ 0, s < 0. 

One of the main purposes of the statistical learning theory is to obtain theoretically the 
generalization errors e^^ and Cj, which are defined as the average of the errors, e^^. and ej 
over the whole set of possible inputs x. Since the input x appears in Eq. (2.9) and Eq. (2.10) 
as inner products A - -x, Bk ■ x and J ■ x, the average over Gaussian vector x could be 
reduced to an average over correlated Gaussian variables. When one defines a set of variables, 
V, vb^ and u as 

v = A-x, (2.12) 
vbJb^ = Bk X, (2.13) 
ulj = J- X, (2.14) 



they obey the multiple Gaussian distribution 

1 

(2vr)a<+2)/2|s|i/2 



P{v,{vbJ,u)= , ^/f^,owoi^n/9 exp , (2.15) 
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Evaluating the correlated Gaussian integrations, the generahzation errors e^^ and ej are 
obtained as 



+ 



DvH 



Rb,v 



1-m 



(2.17) 



and 



+ 



DvH 



Rjv 



i-m 



where Ds is the Gaussian measure defined as 

Ds — 



exp 



and H(-) is the error function defined as 



Dx. 



(2.18) 



(2.19) 



(2.20) 



It should be noted that the dynamical eff'ect of the generalization errors appears only through 
RB). and Rj. This implies that the generalization errors have a fundamental minimum as a 
function of Rb^, and Rj, irrespective of the matter if the values of Rb,. and Rj which give the 
minimum value of the generalization error appear in a particular chosen learning rule of the 
student and the ensemble teachers. An efficient learning rule might realize the fundamental 
minimum for a given learning model. 

Let us defined the update rule in the on-line learning. The ensemble moving teachers 
are updated from the current state B^' using an input x and output of the true teacher A 
for the input a;™" , independently as 

ym'+l 



Bl 



-B^ +fr{x^ ,BT,o^)x^, (2.21) 

where f^' is an update function of the ensemble moving teachers and m' denotes the time 
step of the ensemble moving teachers. In particular, we choose the perceptron learning for the 
update function /fc, which is given by 



/r = r,B@ 



m m 
^B^O 



(2.22) 



Here, r/^ is the learning rate of the ensemble moving teachers. In our analysis, the learning rate 
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r]B independent of the teachers and is fixed during the learning process. After a sufficient 
long learning process using the perceptron rule, the ensemble moving teachers reach steady 
state with Rb^, Ib^. and q^k' fixed. In the present study, we focus our attention to dynamical 
effect of the ensemble teachers for the learning performance of the student. In order to separate 
off a transient effect of the ensemble teachers, the student learns from the ensemble teachers 
in the steady state. The student J is updated using an input x and an output of one of the K 
ensemble moving teachers Bj. chosen randomly. The explicit recursion formula for J™ with 
m being the time step of the student is given by 

J"^+^ = J-^ + g^ix^,J-^,sgnivBjB,))x'^, (2.23) 

where gj^ is an update function of the student and is a uniform random integer chosen 
from 1 to K. Note that the ensemble moving teachers are also updated using the same input. 
We particularly discuss two different learning rules for the student, which are the Hebbian 
learning 

C = r?sgn«JgJ, (2.24) 

and the perceptron learning 

C = i-v^.u"^) sgn {v^J^J . (2.25) 
The learning rate of the student i] is also constant during the learning process. 

3. Order-parameter theory 

As shown in the previous section, the generalization errors of the ensemble teachers and 
the student are expressed in terms of the parameter Rb^ and Rj and evolve only trough a few 
parameters associated with the learning of and J in the thermodynamic limit. It has been 
shown that a class of the on-line learning can be characterized by a few extensive parameters, 
called order parameter. In this section, following ref. 3, a set of ordinal differential equations 
of the order parameters are obtained in our model by taking the thermodynamic limit. 

The learning process of the ensemble moving teachers are described by the three order 
parameter Rb^^ h and Qkk'j which are assumed to be self-averaging. It is sufficient to consider 
the evolution of Rb,, and in order to describe the dynamics of the ensemble teachers, but 
that of the overlap qkk' between two different teachers is necessary for the student dynamics 
as seen later. From the update rules of the ensemble teachers in eq. (2.21), one finds a closed 
formula of the ordinal differential equations of the order parameters as, 

^ ^ ^ h (4) - 4 -] ^ ^ (-^ (//> D (- vSi)) 

(3.1) 
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+ 7^ I 2r/| 
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Dv 



\ 



DxH{z) 



(3.3) 



where 



{q-H- 



RbJi 



z = 



Rlv 



2Rl) 



(3.4) 



[l-q){l+q 

and t' denotes continuous time. We omit the subscript k from the order parameters, because 
the differential equations including their initial conditions have a permutation symmetry for 
the subscript k. Derivation of the differential equation is given in the appendix A. 

Prom these equation one easily obtain the steady solutions oi Rb, Ib and of q as follows: 



Rb = 2 exp 



Ib 



a 



1, 



2vr?7B 
R^ 
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(1 



Rl) 



DvH 



+ 



Rbv 



R-B, 



(3.5) 



(3.6) 



Dv 



DxH{z) 



Rl + 



(3.7) 



+ 



DvH 



Rbv 



1 



^B , 



Note that Rb-, q and Ib/vb depend only on the threshold a of the true teacher. In our study, 
the ensemble teachers are assumed to take the steady state before the student begins to 
learn in order to make the dynamical effect of the ensemble teachers clear. Therefore these 
solutions of Rb, Ib and q are used as an initial condition of the learning dynamics of the 
student discussed below. 

The learning dynamics of the student is also described by a set of ordinal differential equa- 
tions of a few order parameters, which is derived from the update functions for the Hebbian 
rule (2.24) and the perceptron one (2.25). We refer to the appendix B for the derivation of 
the dynamical equations. A straightforward calculation for the Hebbian rule leads to 



dl 
dt 

dRj 
dt 




Rj dl -q 
' I dt I 



Rb, 



(3.8) 
(3.9) 
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(3.10) 



Corresponding differential equations for the perceptron rule are given as 
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(3.13) 



Solving these differential equations for the student and the ensemble teachers, we can obtain 
the generalization errors ej and Rj as a function of time step. 

4. Results and Discussion 

In this section we present dynamical behavior of the order parameter Rj and the general- 
ization error e j obtained by solving numerically the set of the differential equations obtained 
in the previous section. In order to study "dynamical" effect of the ensemble teachers, we 
compare results of two different cases; one with the teachers fixed to a steady state and the 
other with the teachers kept to learn in the steady state sharing the same inputs with the 
student. In this study, we choose the threshold value a = 0.5 of the non-monotonic perceptron 
for the true teacher, yielding Ib/iIb — 0.93, Rb — 0.76 and q ~ 0.91 in the steady state 
for the ensemble teachers. We also perform direct simulations of the given update rules for 
the finite-size perceptrons. In the simulations we use the dimension of vectors = 10'^ and 
perform 10^ trajectories of the learning process for taking the average over the random inputs. 
As shown in figures below, although a limited case with = 0.1 is only shown for avoiding 
crowded plots, the results of Rj and ej obtained by the simulations for all the parameter 
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studied agree with the theoretical ones by the order-parameter differential equations, This 
confirms that the assumption of the self-averaging is appropriate in our model. 

Figure 1 shows time dependence of Rj for the Hebbian learning when the ensemble teachers 
stop to learn and take a steady-state vector. The transient process of Rj depends on the 
learning rate rj of the student and the number K of the ensemble teachers. The value of Rj 
gets larger with increasing the number K and the learning rate rj, meaning that the student 
comes close to the true teacher. As the time t goes on, it approaches monotonically a steady 
value, which increases as K increases. Interestingly, the steady value of Rj exceeds the value 
of Rb when the number K of the ensemble teachers is greater than 1. This is similar to that 
shown in ref. 8. Figure 2, on the other hand, shows the corresponding time dependence of 
Rj when the ensemble teachers continue to learn in their steady state. While at the very 
beginning of the learning process the value of Rj shows monotonic time development similar 
to the case that the ensemble teachers are fixed, it is larger than that with the fixed teachers 
after a certain time and eventually approaches unity, which is independent of the learning 
rate, even if the number K is one. It should be noted that the value of Rb is common in two 
cases of Figs. 1 and 2. This implies that the number K of the ensemble teachers is not efficient 
for the learning of the student, but their continuous learning even with a fixed similarity to 
the true teacher is significantly important. 

Figure 3 shows dynamical behavior of the generalization error of the student for the 
Hebbian learning, which monotonically decreases and eventually converges to the steady value 
when the ensemble teachers are fixed. The steady value of €j only depends on the number K 
and not the learning rate 77. As K increases, the value decreases and furthermore it can be 
smaller than that of the generalization error of the ensemble teachers when K is larger than 
one, reflecting the behavior of Rj. This means that the performance of the student becomes 
better than the ensemble teachers when K > 2. The obtained value of Cj, however, does not 
reach the fundamental minimum value of the generalization error in this case even when K 
increases to infinity. In Fig. 4 the dynamical behavior of ej is shown in the case where the 
ensemble teachers are moving. In contrast to the case of the fixed ensemble teachers, ej shows 
non-monotonic behavior in the learning process and the steady value of independent of both 
K and rj while it is quite larger than e^. The minimum value of Cj reaches the fundamental 
minimum value at a certain time step, depending on the learning rate rj. In a sense, the mobile 
ensemble teachers is a better on-line learning model, while the best performance occurs only 
at a transient state unfortunately. 

Let us turn to the perceptron learning of the student. We show the time development 
of Rj for the fixed and moving ensemble teachers in Figs. 5 and 6, respectively. The steady 
values of Rj coincide with Rb both for the two cases and it is independent of K and rj. 
Further non-monotonic behavior is found for small r] and large K and then the value of Rj 
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takes a maximum value at a certain time step, which exceeds Rb certainly as a transient state. 
Moving the ensemble teachers enhances significantly the maximum value, meaning that the 
student is closer to the true teacher. In particular, for small value of r] the maximum value of 
Rj for the unique moving teacher is larger than that for the K = oo fixed ensemble teachers. 

Fugues 7 and 8 show the corresponding dynamical behavior of the generalization errors Cj 
of the perceptron-learning student with the fixed and mobile ensemble teachers, respectively. 
As expected from the behavior of Rj in Figs. 5 and 6, the steady value of ej for all the 
case is the same as that of the ensemble teachers. However, an essential difference is found 
in transient behavior of ej. Although the minimum value does not necessarily achieve the 
fundamental minimum value of ej in the case of the fixed ensemble teachers, it does for small 
value of 7] in the moving ensemble teachers with a finite time interval as shown in Fig. 8. 
This means again that moving the ensemble teachers plays an important role for the learning 
performance of the student. 
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Fig. 1. Time dependence of the direction cosine Rj between the student J with the Hebbian learning 
and the true teacher A with a = 0.5 in the case that the K ensemble teachers are fixed to be 
a steady state vector. Curves represent numerical solution of the order-parameter differential 
equations with K = 1,50 and oo and rj = 0.1, 1 and 10. The straight line is the direction cosine 
Rb between the fixed ensemble teachers and the true teacher. Symbols represent corresponding 
results obtained by the direct simulation with system size N = 10* and ri = 0.1. 
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5. Conclusion 

We have analyzed the generalization performance of a student supervised by ensemble 
moving teachers in the framework of on-line learning. In this paper we adopted a non- 
monotonic perceptron as a true teacher and a simple perceptron as the ensemble moving 
teachers and the student. We have treated the Hebbian learning and the perceptron learning 
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Fig. 2. Time dependence of the direction cosine Rj between the student J with the Hebbian learning 
and the true teacher A with a = 0.5 in the case that the ensemble teachers continue to learn in 
their steady state. The symbols and the lines are the same as those in Fig. 1. 
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Fig. 3. Time dependence of the generalization error of the student ej between the student J with the 
Hebbian learning and the true teacher A with a = 0.5 in the case of the fixed ensemble teachers. 
The symbols and lines are the same as those in Fig. 1. 



as a learning rule for the student and have calculated the generalization error of the student 
with some order parameters analytically or numerically. In this study, we particularly focus 
on the effect of mobile ensemble teachers on the learning performance of the student. There- 
fore, it is assumed that the ensemble teachers learn only from the true teacher by using the 
perceptron learning and reach a steady state before the student begins to learn. This is helpful 
for separating a transient learning effect of the ensemble teachers from an intrinsic effect. 

In the Hebbian learning, it has been proven that the number K of the ensemble teachers is 
not efficient, but their continuous learning in their steady state is significantly important for 
the student to come close to the true teacher. In the case that the ensemble teachers continue 
to learn, the value of Rj eventually approaches unity, which is independent of the learning 
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Fig. 4. Time dependence of the generalization error of the student e,/ between the student J with 
the Hebbian learning and the true teacher A with a = 0.5 in the case of the mobile ensemble 
teachers. The symbols of the lines and plots are the same as in Fig. 1. 
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Fig. 5. Time dependence of the direction cosine i?j between the student J with the perceptron 
learning and the true teacher A with a — 0.5 in the case of the fixed ensemble teachers. The 
symbols of the lines and plots are the same as in Fig. 1. 



rate, even if the number K is one. Although the student with Rj = 1 does not always mean 
a best learning performance in the Hebbian learning, the minimum value of ej reaches the 
fundamental minimum value as a transient state, regardless of the number K. This is sharp 
contrast to the case of the fixed ensemble teachers, in which the fundamental minimum value 
of ej never occurs. The time step at which gj has a minimum value decreases with increasing 
the learning rate r/, but its precise step has not been predicted theoretically at the present 
moment. 

In the perceptron learning, in contrast to the Hebbian learning, no significant difference 
has been found in the steady states. The steady values of Kj and Ej coincide with those of 
Rb and in both of the fixed and mobile ensemble teachers. However, the effect of the 
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Fig. 6. Time dependence of the direction cosine Rj between the student J with the perceptron 
learning and the true teacher A with a = 0.5 in the case of the mobile ensemble teachers. The 
symbols of the lines and plots are the same as in Fig. 1. 




Fig. 7. Time dependence of the generalization error of the student ej between the student J with 
the perceptron learning and the true teacher A with a — 0.5 in the case that the fixed ensemble 
teachers. The symbols of the lines and plots are the same as in Fig. 1. 



movement of the ensemble teachers appears in the transient state in the learning process, 
where, in particular for the small value of the learning rate r], the maximum value of Rj 
exceeds the value of Rb and then the minimum value of ej reaches the fundamental minimum 
value even if the number K is one. In the case of the fixed ensemble teachers, while the former 
is found only for the large K and small r/, the latter is hardly seen for any parameter observed. 
It would be interesting to see that the result of the mobile ensemble teachers weakly depends 
on the number of the ensemble teachers. Further, the minimum value of Ej for the K = 1 
mobile ensemble teacher is smaller than that for K = oo fixed ensemble teachers. Our study 
suggests that the movement of the ensemble teachers, rather than the number K, is important 
for the student learning in our model. 
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Fig. 8. Time dependence of the generalization error of the student e,/ between the student J with 
the perceptron learning and the true teacher A with a = 0.5 in the case of the mobile ensemble 
teachers. The symbols of the lines and plots are the same as in Fig. 1. 



One of the drawbacks of the present model is that the minimum of ej is given as the 
transient state in the learning process and that no algorithm is found to stop the learning at 
the transient state. We point out that the perceptron learning shows a finite time interval of 
the transient state which gives the minimum of ej as shown in Fig. 8. This might be convenient 
in comparison to the Hebbian learning, but the explicit construction of the stopping algorithm, 
including a practical way, still remains to be solved in further work. 
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Appendix A: Derivation of the learning dynamics for the ensemble teachers 

In this appendix, we derive a set of the ordinal differential equations (3.1), (3.2) and (3.3) of 
the order parameters for the ensemble moving teachers in our model. From the update rules 
of the ensemble teachers of eq. (2.21), a standard calculus'^^ leads to the following ordinal 
differential equations in terms of the average over the correlated Gaussian variables, 





dt' 



dqkk' 
dt' 



dt' Ib). ' 

Qkk' dlB^, _ qkk' dh^ 
dt' Isy dt' 



(A-2) 
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where the continuous time t' is defined by the thermodynamic hmit of m' /N with m' being 
the time step of the ensemble teachers in eq. (2.21). The bracket (• • • ) denotes the average 
with respect to the multiple Gaussian distribution given in eq. (2.15). Since each component 
of A and are generated independently from the Gaussian distribution, A and with 
any k are orthogonal to each other in the thermodynamic limit. Then, the initial conditions 
of the differential equations for Rb^. and qk^i are given by 

0, gO,, = 0, (A.4) 



One easily finds that from eqs. (A-l)-(A-3) and (A-4) that the order parameters i?_Bj,, Ib^ 
Qkk' are invariant under a permutation of the index k of the ensemble teachers. Because of 
the symmetry, we omit the subscripts k from the order parameters. We can calculate sample 
averages in eqs. (A-l)-(A-3) and obtain 



ifkVBk) 
ifk) 

ifkv) 

ifk'VB^) 
{fkfk') 



'HB 



VB 



RBi2eKp[- — 



DvH 



1 



1 



Rbv 



V2tt 

--{fkVBy 



2 exp 



VB 



RBi2exp[- — 



+ 



Dv 



DxH{z), 



(A-5) 
(A-6) 

(A-7) 
(A-8) 
(A-9) 



where 



iq-Rl)x + RBJl 



Rlv 



2Rl) 



(A-IO) 



^(1-9)(1 + ' 

Substituting them into eqs. (A-1), (A-2) and (A-3), the differential equations (3.1), (3.2) and 
(3.3) are derived. 

Appendix B: Derivation of the learning dynamics for the student 

As in the appendix A, a set of the differential equations for the student dynamics is derived 
in this appendix. From the update rule (2.23) of the student, the standard calculus again leads 
to the following equations: 



k=l ^ ' 



(B-1) 



15/18 



J. Phys. Soc. Jpn. 



dRj Rj dl 1 



K 



{9kv) 



dt 
dt 



I dt K ^ I 

k=l 

Rb^j dl Rb^j dlsk 



Full Paper 
(B-2) 



1 



I dt Ib^. dt 
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k'=i " ^ 

where t denotes a continuous time defined hy t = m/N. As an initial condition of eqs. (B-2) 
and (B-3), we take 



Rj — 0, Rb 



0, 



(B-4) 



since A, Bl and J° are orthogonal to each other in the thermodynamic limit. It is shown 
from eqs. (B-4) and (B-3) that the order parameter Rb^J does not depend on the index k. 
Then, one can omit the subscript k from the order parameter without loss of the generality. 
By substituting the two update functions g of the Hebbian and the perceptron learning re- 
spectively, one calculates the Gaussian averages in eqs. (B-l)-(B-3) in the case of the Hebbian 
learning as 

^ (B-5) 

(B-6) 
(B-7) 

Rbj , (B-8) 

(B-9) 



(gku) = rjyl -Rbj, 



id) 



(gkv) = r]\l -Rb, 



ifku) 



riB 



i?j<!2exp(-y ) -1 



idk'VBk) = Vxl -Q^k,k', 



{fkgw 



-2r]r]B 



+ \Dv 

oo JO 



Dx{2H{z) - 1} 



{fkgk) = -2r]r]B 



+ / ] DvH 

oo Jo 



Rbv 



1-^1. 



and in the case of the perceptron learning as 



{gku) 



{R 



BJ 



1 



Ik) = — tan ^ 
vr 



1 - 



Rbj 



{gkv) 



(Rb -Rj), 



(B-10) 
(B-11) 

(B-12) 

(B-13) 
(B-14) 
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Here, zi and Z2 are defined as 



^1 



{RBj-RBRj)[VT^y + Jl + q-2Rlx]+RjJ{l-Rl){l + 



(1 - {(1 + q){l - Rj) - 2{Rl - 2RbRjRbj + Rlj)] 



2Rl)v 



(B-19) 



and 



Z2 



{Rbj - RbRj)x + Rj^l - R^v 

- ) 

1 — R^j — R\ — R\ j + 2RbRjRbj 



(B-20) 



and (5fc^fc/ is the Kronecker delta defined by 

r +1, k = k', 

[ 0, k^k'. 

Inserting (B-5)-(B-ll) and (B-12)-(B-18) into (B-l)-(B-3) gives the dynamical equations (3 
(3.10) for the Hebbian rule and those (3.11)-(3.13) for the perceptron one, respectively. 



(B-21) 
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